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Abstract 

We propose a Bayesian evidence framework to facilitate 
transfer learning from pre-trained deep convolutional neu¬ 
ral networks (CNNs). Our framework is formulated on top 
of a least squares SVM (LS-SVM) classifier, which is simple 
and fast in both training and testing, and achieves compet¬ 
itive performance in practice. The regularization param¬ 
eters in LS-SVM is estimated automatically without grid 
search and cross-validation by maximizing evidence, which 
is a useful measure to select the best performing CNN out 
of multiple candidates for transfer learning; the evidence 
is optimized efficiently by employing Aitken’s delta-squared 
process, which accelerates convergence of fixed point up¬ 
date. The proposed Bayesian evidence framework also pro¬ 
vides a good solution to identify the best ensemble of hetero¬ 
geneous CNNs through a greedy algorithm. Our Bayesian 
evidence framework for transfer learning is tested on 12 vi¬ 
sual recognition datasets and illustrates the state-of-the-art 
performance consistently in terms of prediction accuracy 
and modeling efficiency. 

1. Introduction 

Image representations from deep CNN models trained 
for specific image classification tasks turn out to be pow¬ 
erful even for general purposes [2, 6, 7, 21, 23] and use¬ 
ful for transfer learning or domain adaptation. There¬ 
fore, CNNs trained on specific problems or datasets are 
often fine-tuned to facilitate training for new tasks or do¬ 
mains [2, 6, 13, 15, 16, 36], and an even simpler approach— 
application of off-the-shelf classification algorithms such as 
SVM to the representations from deep CNNs [7]—is getting 
more attractive in many computer vision problems. How¬ 
ever, fine-tuning of an entire deep network still requires a 
lot of efforts and resources, and SVM-based methods also 
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Select the best model by evidence 
Train Bayesian LS-SVM (No cross-validation) 



Figure 1. We address a problem to select the best CNN out of 
multiple candidates as shown in this figure. Additionally, our al¬ 
gorithm is capable of identifying the best ensemble of multiple 
CNNs to further improve performance. 

involve time consuming grid search and cross validation to 
identify good regularization parameters. In addition, when 
multiple pre-trained deep CNN models are available, it is 
unclear which pre-trained models are appropriate for tar¬ 
get tasks and which classifiers would maximize accuracy 
and efficiency. Unfortunately, most existing techniques for 
transfer learning or domain adaptation are limited to empir¬ 
ical analysis or ad-hoc application specific approaches. 

We propose a simple but effective algorithm for transfer 
learning from pre-trained deep CNNs based on Bayesian 
least squares SVM (LS-SVM), which is formulated with 
Bayesian evidence framework [18, 29] and LS-SVM [26]. 
This approach automatically determines regularization pa¬ 
rameters in a principled way, and shows comparable perfor¬ 
mance to the standard SVMs based on hinge loss or squared 
hinge loss. More importantly, Bayesian LS-SVM provides 
an effective solution to select the best CNN out of multiple 





















candidates and identify a good ensemble of heterogeneous 
CNNs for performance improvement. Figure 1 illustrates 
our approach. We also propose a fast Bayesian LS-SVM, 
which maximizes the evidence more efficiently based on 
Aitken’s delta-squared process [1]. 

One may argue against the use of LS-SVM for clas¬ 
sification because the least squares loss function in LS- 
SVM tends to penalize well-classified examples. How¬ 
ever, least squares loss is often used for training multi¬ 
layer perceptron [4] and shows comparable performance to 
SVMs [28, 37]. In addition, Bayesian LS-SVM provides 
a technically sound formulation with outstanding perfor¬ 
mance in terms of speed and accuracy for transfer learning 
with deep representations. We also propose a fast Bayesian 
LS-SVM, which maximizes the evidence more efficiently 
based on Aitkens delta-squared process [1]. Considering 
simplicity and accuracy, we claim that our fast Bayesian 
LS-SVM is a reasonable choice for transfer learning with 
deep learning representation in visual recognition prob¬ 
lems. Based on this approach, we achieved promising re¬ 
sults compared to the state-of-the-art techniques on 12 vi¬ 
sual recognition tasks. 

The rest of this paper is organized as follows. Section 2 
describes examples of transfer learning or domain adapta¬ 
tion based on pre-trained CNNs for visual recognition prob¬ 
lems. Then, we discuss Bayesian evidence framework ap¬ 
plicable to the same problem in Section 3 and its accelera¬ 
tion technique using Aitken’s delta-squared process in Sec¬ 
tion 4. The performance of our algorithm in various appli¬ 
cations is demonstrated in Section 5. 

2. Related Work 

Since AlexNet [17] demonstrated impressive perfor¬ 
mance in the ImageNet large scale visual recognition chal¬ 
lenge (LSVRC) 2012, a few deep CNNs with different ar¬ 
chitectures, e.g., VGG [25] and GoogLeNet [27], have been 
proposed in the subsequent events. Instead of training deep 
CNNs from scratch, some people have attempted to refine 
pre-trained networks for new tasks or datasets by updating 
the weights of all neurons or have adopted the intermediate 
outputs of existing deep networks as generic visual feature 
descriptors. These strategies can be interpreted as transfer 
learning or domain adaptation. 

Refining a pre-trained CNN is called fine-tuning, where 
the architecture of the network may be preserved while 
weights are updated based on new training data. Fine- 
tuning is generally useful to improve performance [2, 6, 
13, 36] but requires careful implementation to avoid over¬ 
fitting. The second approach regards the pre-trained CNNs 
as feature extraction machines and combines the deep rep¬ 
resentations with the off-the-shelf classifiers such as linear 
SVM [7, 34], logistic regression [7, 34], and multi-layer 
neural network [21]. The techniques in this category have 


been successful in many visual recognition tasks [2, 23, 24]. 

When combining a classification algorithm with image 
representations from pre-trained deep CNNs, we often face 
a critical issue. Although several deep CNN models trained 
on large scale image repositories are publicly available, 
there is no principled way to select a CNN out of multi¬ 
ple candidates and find the best ensemble of multiple CNNs 
for performance optimization. Existing algorithms typically 
rely on ad-hoc methods for model selection and fail to pro¬ 
vide clear evidence for superior performance [2]. 

3. Bayesian LS-SVM for Model Selection 

This section discusses a Bayesian evidence framework to 
select the best CNN model(s) in the presence of transferable 
multiple candidates and identify a reasonable regularization 
parameter for LS-SVM classifier automatically. 

3.1. Problem Definition and Formulation 

Suppose that we have a set of pre-trained deep CNN 
models denoted by {CNN^Im = 1... M}. Our goal is 
to identify the best performing deep CNN model among the 
M networks for transfer learning. A naive approach is to 
perform fine tuning of network for target task, which re¬ 
quires substantial efforts for training. Another option is to 
replace some of fully connected layers in a CNN with an 
off-the-shelf classifier such as SVM and check the perfor¬ 
mance of target task through parameter tuning for each net¬ 
work, which would also be computationally expensive. 

We adopt a Bayesian evidence framework based on LS- 
SVM to achieve the goal in a principled way, where the 
evidence of each network is maximized iteratively and the 
maximum evidences are used to select a reasonable model. 
During the evidence maximization procedure, the regular¬ 
ization parameter of LS-SVM is identified automatically 
without time consuming grid search and cross-validation. 
In addition, the Bayesian evidence framework is also ap¬ 
plied to the construction of an ensemble of multiple CNNs 
to accomplish further performance improvement. 

3.2. LS-SVM 

We deal with multi-label or multi-class classification 
problem, where the number of categories is K. Let T) — 
k = 1... K}n=i...N be a training set, where 
Xn G is a feature vector and is a binary variable 

that is set to 1 if label k is given to x^ and 0 otherwise. 
Then, for each class k, we minimize a least squares loss 
with L 2 regularization penalty as follows: 

min 

where X = [xi,...,Xn] G and = 

\y^i \ ..., 2/^^]^ € The optimal solution of the prob- 


lem in ( 1 ) is given by 


= U{S+ Xy^^\ (2) 

where USU^ is the eigen-decomposition of XX^ and 
I is an identity matrix. This regularized least squares ap¬ 
proach has clear benefit that it requires only one eigen- 
decomposition of XX^ to obtain the solution in ( 2 ) for 
all combinations of and 


The log evidence C{a, (3) is maximized by repeatedly 
alternating the following fixed point update rules 
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3.3. Bayesian Evidence Framework 

The optimization of the regularized least squares formu¬ 
lation presented in ( 1 ) is equivalent to the maximization of 
the posterior with fixed hyperparamters a and (3 denoted by 
p{w\y,X,a,f3), where A = Q;//3. The posterior can be 
decomposed into two terms by Bayesian theorem as 

p{w\y, X, a, (3) cx p{y\X, w, I3)p{w\a), (3) 

where p{y\X,w, P) corresponds to Gaussian observation 
noise model given by 


N 

p{y\X,w,P) = Y[^f{yn\x],w,p-'^) (4) 

n—1 

and p{w\a) denotes a zero-mean isotropic Gaussian prior 
as 


where {sd}£^]^ are eigenvalues of XX^. Note that m and 
7 should be re-estimated after each update of a and /3. 

Another pair of update rules of a and /? are derived by 
an expectation-maximization (EM) technique as 
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but these procedures are substantially slower than the fixed 
point update rules in ( 8 ). 

Through the optimization procedures described above, 
we determine the regularization parameter A = a/p. Al¬ 
though the estimated parameters are not optimal, they may 
still be reasonable solutions since they are obtained by max¬ 
imizing marginal likelihood in ( 6 ). 


p{w\a) = N'{w\0,a ^I). (5) 

Note that we dropped superscript (fc) for notational simplic¬ 
ity from the equations in this subsection. 

In the Bayesian evidence framework [18, 29], the evi¬ 
dence, also known as marginal likelihood, is a function of 
hyperparameters a and P as 

p{y\X,a,P) = J p{y\X,w,P)p{w\a)dw. ( 6 ) 

Under the probabilistic model assumptions corresponding 
to (4) and (5), the log evidence C{a, P) is given by 

C{a,P) = logp{y\X,a,P) (7) 

= ^ log a -1- y log /? - i log \A\ 

- - ^rnJrn- y log27r, 

where the precision matrix and mean vector of the posterior 
p{w\y,XP) = X'{w\m^ A~^) are given respectively 
by 

A = al + PXX^ and m = PA-^Xy. 


3.4. Model Selection using Evidence 

The evidence computed in the previous subsection is for 
a single class, and the overall evidence for entire classes, de¬ 
noted by £*, is obtained by the summation of the evidences 
from individual classes, which is given by 

K 

£* = (12) 
fc=i 

We compute the overall evidence corresponding to each 
deep CNN model, and choose the model with the maximum 
evidence for transfer learning. We expect that the selected 
model performs best among all candidates, which will be 
verified in our experiment. 

In addition, when an ensemble of deep CNNs needs to be 
constructed for a target task, our approach selects a subset 
of good pre-trained CNNs in a greedy manner. Specifically, 
we add a network with the largest evidence in each stage and 
test whether the augmented network improves the evidence 
or not. The network is accepted if the evidence increases, or 
rejected otherwise. After the last candidate is tested, we ob¬ 
tain the final network combination and its associated model 
learned with the concatenated feature descriptors from ac¬ 
cepted networks. 
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4. Fast Bayesian LS-SVM 

Bayesian evidence framework discussed in Section 3 is 
useful to identify a good CNN for transfer learning and a 
reasonable regularization parameter. To make this frame¬ 
work even more practical, we present a faster algorithm to 
accomplish the same goal and a new theory that guarantees 
the converges of the algorithm. 

4.1. Reformulation of Evidence 

We are going to reduce £(a, f3) to a function with only 
one parameter that directly corresponds to the regularization 
parameter A = a/(3. To this end, we re-write C(a,/3) by 
using the eigen-decomposition XX^ = USU^ as 

/3) = y log a -f y log /3 - i ^ log(a -f /3sd) 




(13) 

2 ^^ 2^Q( + /3sd 2 ® 

where Sd is the d-th diagonal element in S and hd denotes 
the d-th element in /r = Xy. Then, we re-parameterize 
£(q!, /3) into J^(A, (3) as 

^) = y log A -b y log /3 - ^ ^ log(A -b Sd) 
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The derivative of J^(A, /3) with respect to /3 is given by 
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and we obtain the following equation by setting this deriva¬ 
tive to zero. 
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Finally, we obtain a one-dimensional function of the log ev¬ 
idence by plugging (15) into (14), which is given by 
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Figure 2 illustrates the curvature of this log evidence func¬ 
tion with respect to log A. 
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Figure 2. Plot of the log evidence J-{X) with respect to log A. Note 
that A^(A) is neither convex nor concave. 

4.2. New Fixed-point Update Rule 

We now derive a new fixed point update rule and present 
the sufficient condition for the existence of a fixed point. 
The stationary points in (16) with respect to A satisfy 
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and we update the fixed-point by maximizing (16) as 
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As illustrated in Figure 2, X/X) in (16) is neither convex nor 
concave as illustrated in the supplementary file. Flowever, 
we can show the sufficient condition of the existence of the 
fixed point using the following theorem. 

Theorem 1. Denote the update rule in (18) by /(A). Ify is 
a binary variable and Xn is an L 2 normalized nonnegative 
vector, then /(A) has a fixed point. 

Proof. We first show that /(A) is asymptotically linear in A 
as 
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Since y is binary and a;„ is L 2 normalized and nonnegative, 
we can derive the following two relations. 
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Figure 3. Aitken’s delta-squared process. The fixed point update 
function /(A) is approximated by green dashed line and its inter¬ 
section with y = X becomes the next update point. 


Algorithm 1 Fast Bayesian Least Squares 


Input: X G and y G 

Output: Optimal solutions (lu, A). 
Initialize A // e.g., A = 1 

{U, S) -^r- eigen-decomposition(XX^) 
s^diag(5), h^U^Xy 
repeat 
Aq ^ A 

Ai ^ UPDATE (Ao, s, h, N, y^y) 

Az ^UPDATE (Ai,s,h,fV,yTy) 


A •(— An — 


(Ai —Aq)^ 


{A2 —Ai) —(Ai —Aq) 
if A < 0 or A = ±00 then 
A •<— A2 

end if 

until |A — AqI < e // e.g., e = 10“® 
w ^ U{S + XI)-^h 


where P = ^^=1 Vn- From (19) and (20), it is shown that 

\\ynX\\j.<N\\Xyr. 

Obviously, /(O) > 0 and there exists a A"*" such that 
/(A+) < A’*'. The intermediate value theorem implies the 
existence of A* such that /(A*) = A*, where 0 < A* < A+ 
as illustrated in Eigure 3. □ 

The fixed point is unique if /(A) is concave. Although it 
is always concave according to our observation, we have no 
proof yet and leave it as a future work 

4.3. Speed Up Algorithm 

We accelerate the fixed point update rule in (18) by us¬ 
ing Aitken’s delta-squared process [1]. Eigure 3 illustrates 
the Aitken’s delta-squared process. Let’s focus on the two 
points (Ao, /(Ao)) and (Ai, / (Ai)), and line going through 


Algorithm 2 A = UPDATE(A, s, h, N, y^y) 
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Figure 4. Two failure cases of Aitken’s delta-squared process, 
(left) The first case arises if initial Aq is far from the fixed point 
A*, which results in A < 0. (right) The second case occurs when 
approximating line (dashed green) is parallel to 3 / = A, where 
A = ± 00 . 


these two points. The equation of this line is 

2 / = Ai -F (A — Aq)^^^-(21) 

where /(Ao) and /(Ai) are replaced by Ai and A 2 , respec¬ 
tively. The idea behind Aitken’s method is to approximate 
fixed point A* using the intersection of the line in (21) with 
line y = X, which is given by 


_(Ai — Aq)^_ 

(A 2 — Ai) — (Ai — Ao) 


( 22 ) 


Our fast Bayesian learning algorithm for the regularized 
least squares problem in (1) is summarized in Algorithm 1. 
In our algorithm, we first compute the eigen-decomposition 
of XX^. This is the most time consuming part but needs 
to be performed only once since the result can be reused for 
every label in y. After that, we obtain the regularization 
parameter A through an iterative procedure. 

When we apply the Aitken’s delta-squared process, we 
have two potential failure cases as in Figure 4(a) and 4(b). 
The first case often arises if the initial Aq is far from the 
fixed point A*, and the second case occurs when the approx¬ 
imating line in (21) is parallel to y = X. Fortunately, these 
failures rarely happen in practice and can be handled easily 
by skipping the procedure in (22) and updating A with A 2 . 

Figure 5 demonstrates the relative convergence rates of 
three different techniques—Aitken’s delta-squared process 
in Algorithm 1, fixed point update rules in ( 8 ), and EM up¬ 
date method, where the Aitken’s delta-squared process is 
significantly faster than others for convergence. 

































Figure 5. Comparison between Aitken’s delta-squared process, 
fixed point update rules, and EM update rules on PASCAL VOC 
2012 dataset (class = aeroplane). Aitken’s delta-squared process 
significantly faster than other methods. 


5. Experiments 

We present the details of our experiment setting and the 
performance of our algorithm compared to the state-of-the- 
art techniques in 12 visual recognition benchmark datasets. 

5.1. Datasets and Image Representation 

The benchmark datasets involve various visual recog¬ 
nition tasks such as object recognition, photo annotation, 
scene recognition, fine grained recognition, visual attribute 
detection, and action recognition. Table 1 presents the char¬ 
acteristics of the datasets. In our experiment, we followed 
the given train and test split and evaluation measure of each 
dataset. For the datasets with bounding box annotations 
such as CUB200-2011, UIUC object attribute. Human at¬ 
tribute, and Stanford 40 actions, we enlarged the bounding 
boxes by 150% to consider neighborhood context as sug¬ 
gested in [23, 2]. 

For deep learning representations, we selected 4 pre¬ 
trained CNNs from the Caffe Model Zoo: GoogLeNet [31], 
VGG19 [25], and AlexNet [ ] trained on ImageNet, and 
GoogLeNet trained on Places [31]. As generic image repre¬ 
sentations, we used the 4096 dimensional activations of the 
first fully connected layer in VGG19 and AlexNet and the 
1024 dimensional vector obtained from the global average 
pooling layer located right before the final softmax layer in 
GoogLeNet. 

Our implementation is in Matlab2011a, and all exper¬ 
iments were conducted on a quad-core Intel(R) core(TM) 
i7-3820 @ 3.60GHz processor. 

5.2. Bayesian LS-SVM vs. SVM 

We first compare the performance of our Bayesian LS- 
SVM with the standard SVM when they are applied to deep 


CNN features for visual recognition problems. We used 
only a single image scale 256 x 256 in this experiment. LI- 
BLINEAR [10] package is used for SVM training and the 
regularization parameters are selected by grid search with 
cross validations. 

Table 2 presents the complete results of our experiment. 
Bayesian LS-SVM is competitive to SVM in terms of pre¬ 
diction accuracy even with significantly reduced training 
time. Training SVM is getting slower than Bayesian LS- 
SVM as the number of classes increases so it is particularly 
slow in Caltech 256 and SUN 397 datasets. 

Another notable observation in Table 2 is that the order 
of prediction accuracy is highly correlated to the evidence. 
This means that the selected model by Bayesian LS-SVM 
produces reliable testing accuracy and a proper deep learn¬ 
ing image representation is obtained without time consum¬ 
ing grid search and cross validation. Note that cross valida¬ 
tions in LS-SVM and SVM play the same role, but are less 
reliable and slower than our Bayesian evidence framework. 
The capability to select the appropriate CNN model and the 
corresponding regularization parameter is one of the most 
important properties of our algorithm. 

5.3. Comparison with Other Methods 

We now show that our Bayesian LS-SVM identifies a 
combination of multiple CNNs to improve accuracy with¬ 
out grid search and cross validation. For each task, we se¬ 
lect a subset of 4 pre-trained CNNs in a greedy manner; 
we add CNNs to our selection, one by one, until the evi¬ 
dence does not increase. Our algorithm is compared with 
DeCAF [7], Zeiler [34], INRIA [21], KTH-S [23], KTH- 
FT [2], VGG [25], Zhang [35, 36], and TUBFI [3]. In ad¬ 
dition, our ensembles identified by greedy evidence maxi¬ 
mization are compared with the oracle combinations—the 
ones with the highest accuracy in test set found by exhaus¬ 
tive search—and the best combinations found by exhaustive 
evidence maximization. 

Table 3 presents that our ensembles approach achieves 
the best performance in most of the 12 tasks. The identi¬ 
fied ensembles by the greedy approach are consistent with 
the selections by exhaustive evidence maximization and 
even oracle selections' made by testing accuracy maxi¬ 
mization. Note that our network selections are natural 
and reasonable; GoogLeNet-ImageNet and VGG19 are se¬ 
lected frequently while GoogLeNet-Place is preferred to 
GoogLeNet-ImageNet in MIT Indoor and SUN-397 since 
the datasets are constructed for scene recognition. It turns 
out that the proposed algorithm tends to choose the net¬ 
works with higher accuracies in the target task even though 
it makes selections based only on the evidence in a greedy 
manner. An interesting observation is that our result is less 

* This option is practically impossible since it requires evaluation with 
test dataset using all available models for model selection. 







Table 1. Characteristics of the 12 datasets. A'^i: number of training data, A^ 2 : number of test data, 7T: number of classes, L: average number 
of labels per image, AP: average precision, Acc.: accuracy, AUC: area under the ROC curve. 


Dataset 

Task 


N2 

K 

L 

Box 

Measure 

PASCAL VOC 2007 [8] 

object recognition 

5011 

4952 

20 

1.5 


mean AP 

PASCAL VOC 2012 [9] 

object recognition 

5717 

5823 

20 

1.5 


mean AP 

Caltech 101 [12] 

object recognition 

3060 

6086 

102 

1 


mean Acc. 

Caltech 256 [14] 

object recognition 

15420 

15187 

257 

1 


mean Acc. 

ImageCLEF 2011 [20] 

photo annotation 

8000 

10000 

99 

11.9 


mean AP 

MIT Indoor Scene [22] 

scene recognition 

5360 

1340 

67 

1 


mean Acc. 

SUN 397 Scene [32] 

scene recognition 

19850 

19850 

397 

1 


mean Acc. 

CUB 200-2011 [30] 

fine-grained recognition 

5994 

5794 

200 

1 


mean Acc. 

Oxford Flowers [ 1 9] 

fine-grained recognition 

2040 

6149 

200 

1 


mean Acc. 

UIUC object attributes [11] 

attribute detection 

6340 

8999 

64 

7.1 


mean AUC 

Human attributes [5] 

attribute detection 

4013 

4022 

9 

1.8 


mean AP 

Stanford 40 actions [33] 

action recognition 

4000 

5532 

40 

1 


mean AP 


Table 2. Bayesian LS-SVM versus SVM. Without time consuming cross validation procedure, Bayesian LS-SVM achieves prediction 
accuracy competitive to SVM. In addition, Bayesian LS-SVM selects the proper CNN for each task by using the evidence (see bold¬ 
faced numbers). Best accuracy in LS-SVM and SVM denotes the maximum achievable accuracy in test dataset using all available learned 
models. Note that the selected model by Bayesian evidence framework or cross validation may not be the best one in testing. The following 
sets of regularization parameters are tested for cross validation in LS-SVM and SVM, respectively: 2“®,2®, 2^°} and 

{0.01, 0.05, 0.1, 0.5,1, 2, 5,10}. (G,: GoogLeNet-lmageNet, Gp: GoogLeNet-Place, V: VGG19, and A: AlexNet) 



LS-SVM 

SVM 

LS-SVM 

SVM 




Bayesian 


CV (5 folds) 


CV (5 folds) 



Bayesian 


CV (5 folds) 


CV (5 folds) 

CNN 

Best 

Acc. 

Evidence 

Time 

Acc. 

Time 

Best 

Acc. 

Time 

Best 

Acc. 

Evidence 

Time 

Acc. 

Time 

Best 

Acc. 

Time 


PASCAL VOC 2007 [8] 

SUN-397 [32] 

G| 

85.3 

85.2 

46.9 xlO-" 

1.1 

85.2 

8.4 

85.0 

84.7 

122.4 

48.1 

47.0 

12.8 xl0° 

3.1 

48.1 

36.5 

54.2 

54.2 

8739.6 

Gp 

74.1 

73.8 

38.6 xlO^ 

1.0 

74.0 

8.1 

74.1 

73.9 

144.3 

61.1 

60.1 

13.2 XlO® 

2.9 

61.1 

34.4 

63.3 

63.3 

8589.4 

V 

85.9 

85.8 

48.0 X 10® 

41.9 

85.8 

172.2 

85.9 

85.8 

257.5 

55.0 

53.7 

12.9 XlO® 

57.4 

54.9 

419.8 

57.1 

57.1 

20254.0 

A 

75.2 

75.0 

32.5 X 10® 

41.7 

75.0 

160.4 

75.3 

75.2 

211.1 

45.4 

44.9 

12.7 XlO® 

50.8 

45.4 

419.0 

48.6 

48.6 

10781.8 


PASCAL VOC 2012 [9] 

CUB-200 [30] 

Gi 

84.4 

84.3 

51.3 xlO-" 

1.2 

84.3 

8.6 

83.9 

83.7 

140.8 

65.2 

64.3 

15.6 xlO^ 

1.3 

64.1 

11.0 

67.6 

56.5 

1201.9 

Gp 

73.2 

72.9 

40.6 X 10® 

1.1 

73.1 

8.4 

73.2 

73.1 

170.7 

16.4 

13.6 

14.9 XlO® 

1.5 

15.0 

11.1 

16.8 

11.1 

1664.6 

V 

85.2 

85.1 

52.9 X 10® 

42.7 

85.2 

161.5 

85.6 

85.4 

295.9 

69.2 

68.6 

15.8 XlO® 

44.1 

61.5 

259.2 

71.1 

59.4 

2776.2 

A 

74.1 

73.9 

34.3 xlO® 

42.7 

74.0 

161.8 

74.4 

74.3 

160.7 

59.0 

58.5 

15.5 XlO® 

45.3 

46.6 

257.9 

61.4 

51.6 

1645.5 


Caltech iOI [12] 

Oxford Flowers [ 1 9] 

Gi 

90.6 

90.0 

37.8 xlO'* 

1.0 

89.6 

6.0 

91.4 

85.1 

325.0 

85.5 

84.7 

21.8 xl0‘* 

0.9 

82.0 

5.5 

87.4 

72.0 

198.8 

Gp 

57.0 

54.3 

30.6 X 10® 

0.9 

55.1 

5.9 

57.2 

41.8 

390.3 

55.6 

51.7 

19.4 XlO® 

0.9 

51.8 

5.5 

57.1 

32.8 

234.7 

V 

92.2 

92.1 

40.9 xlO® 

31.5 

88.8 

142.7 

92.2 

86.8 

729.4 

87.5 

87.1 

22.5 XlO® 

26.9 

82.1 

142.2 

87.6 

73.4 

520.9 

A 

89.3 

89.2 

37.3 xlO® 

32.0 

83.4 

146.9 

90.0 

83.5 

595.3 

87.6 

87.6 

22.9 XlO® 

27.3 

81.8 

146.7 

88.3 

77.1 

271.3 


Caltec 

5 256 [14] 

UIUC Attributes [11] 

G| 

77.8 

77.2 

59.9 xlO^ 

2.3 

77.8 

21.8 

81.2 

81.2 

4060.4 

91.5 

90.3 

13.5 xl0“ 

1.4 

90.9 

8.0 

91.3 

90.6 

605.5 

Gp 

44.9 

42.6 

55.9 xlO® 

2.2 

44.9 

21.2 

48.6 

48.6 

4991.8 

87.8 

86.6 

10.5 XlO® 

1.3 

87.1 

7.4 

88.0 

87.6 

726.0 

V 

82.0 

81.1 

62.3 xlO® 

52.5 

81.7 

339.7 

82.7 

82.7 

9653.1 

92.5 

91.1 

14.4 XlO® 

43.8 

92.0 

186.3 

92.2 

91.7 

1285.4 

A 

69.7 

68.9 

58.6 xlO® 

52.9 

69.7 

336.9 

72.3 

72.3 

5348.6 

91.4 

89.9 

12.9 XlO® 

44.1 

91.0 

191.2 

90.8 

90.5 

683.7 


ImageCLEF [20] 

Human AtUibutes [5] 

Gi 

49.1 

48.9 

20.5 X 10“ 

1.5 

48.8 

37.0 

47.7 

47.4 

1218.6 

76.0 

75.8 

-74.8 xlO^ 

1.0 

75.8 

5.0 

74.2 

74.1 

70.6 

Gp 

47.5 

47.1 

20.8 X 10® 

1.4 

47.1 

36.9 

47.1 

46.7 

1410.5 

58.7 

58.4 

-103.1 XlO® 

1.0 

58.0 

4.8 

56.9 

56.5 

85.5 

V 

50.7 

50.3 

21.3 xlO® 

45.9 

50.4 

248.5 

50.4 

50.1 

2531.2 

75.4 

75.1 

-76.0 XlO® 

40.3 

75.2 

124.2 

73.1 

72.8 

131.9 

A 

44.8 

44.6 

18.7 XlO® 

46.1 

44.6 

245.9 

44.4 

44.1 

2140.0 

71.9 

71.3 

-84.4 XlO® 

40.7 

71.7 

121.2 

70.0 

69.9 

63.3 


MIT Indoor [22] 

Stanford 40 Action [33] 

Gi 

66.7 

66.0 

30.1 Xl0“ 

1.2 

66.7 

5.8 

69.4 

69.2 

400.9 

70.2 

69.8 

100.4 XlO-" 

1.0 

69.6 

11.6 

69.8 

69.6 

211.7 

Gp 

80.0 

79.9 

35.2 X 10® 

1.1 

80.0 

5.8 

81.1 

80.4 

402.5 

48.3 

47.6 

86.5 X 10® 

1.1 

47.9 

11.4 

48.2 

47.7 

246.2 

V 

73.2 

73.1 

31.1 XlO® 

42.6 

73.2 

186.8 

74.7 

74.7 

895.5 

75.4 

75.2 

109.3 XlO® 

41.1 

75.1 

142.9 

75.8 

75.3 

418.7 

A 

62.0 

61.1 

28.6 X 10® 

42.2 

60.5 

187.4 

63.1 

63.1 

460.9 

58.0 

57.7 

89.6 XlO® 

41.5 

57.5 

156.5 

57.4 

57.1 

206.8 


consistent with the selections by oracle and exhaustive evi¬ 
dence maximization in Stanford 40 Actions dataset, where 
GoogLeNet-Place seems to provide complementary infor¬ 
mation even with its low accuracy and is helpful to improve 


recognition performance. It is probably because actions are 
frequently performed at typical places, e.g., a fair portion of 
images in brushing teeth class are taken from bathrooms. 













































































Table 3. Comparison to existing methods in the 12 benchmark datasets. The best ensembles identified by maximizing evidence through 
exhaustive search mostly coincide with the oracle combinations—the ones with the highest accuracy in test set, which is also found by 
exhaustive search. The ensembles identified by our greedy search are very similar to the ones by these exhaustive search methods, and our 
algorithm consequently performs best in many tested datasets. We used three scales {256,384, 512} as done in [25], where we simply 
averaged the prediction scores from three scales. 


Method 

VOC07 

VOC 12 

CALlOl 

CAL256 

CLEF 

MIT 

SUN 

Birds 

Flowers 

UIUC 

Human 

Action 

DeCAF 



86.9 




38.0 

65.0 




. 

Zeiler 


79.0 

86.5 

74.2 








- 

INRIA 

77.7 

82.8 










- 

KTH-S 

71.8 





64.9 

49.6 

62.8 

90.5 

90.6 

73.8 

58.9 

KTH-FT 

80.7 



- 


71.3 

56.0 

67.1 

91.3 

91.5 

74.6 

66.4 

VGG 

89.7 

89.3 

92.7 

86.2 








- 

Zhang 








76.4 



79.0 

- 

TUB FI 





44.3 







- 

G| 

87.5 

86.2 

90.5 

77.7 

50.3 

71.3 

48.3 

64.7 

88.1 

91.1 

78.4 

71.0 

Gp 

75.7 

74.9 

53.8 

42.1 

48.1 

80.8 

59.8 

14.9 

57.8 

87.3 

59.7 

48.4 

V 

88.4 

87.8 

93.3 

83.3 

52.4 

77.8 

56.1 

69.9 

91.5 

91.8 

79.1 

77.0 

A 

75.0 

73.9 

88.3 

69.7 

52.3 

77.5 

42.4 

60.7 

86.7 

89.9 

71.3 

57.7 

Oracle 

G|GpV 

G|GpV 

G|VA 

GiGpVA 

GiGpVA 

GpVA 

GiGpVA 

G|VA 

GiGpVA 

G|VA 

G|VA 

G|GpV 

(exhaustive) 

90.0 

89.4 

95.3 

86.1 

55.7 

84.9 

67.5 

77.3 

94.7 

92.0 

80.8 

78.6 

Max evid. 

G|GpV 

G|GpV 

G|VA 

G|GpVA 

G|GpV 

GpV 

GpV 

G|VA 

G|VA 

G|GpVA 

G|VA 

G|GpV 

(exhaustive) 

90.0 

89.4 

95.3 

86.1 

55.5 

84.7 

67.5 

77.3 

94.5 

92.0 

80.8 

78.6 

Ours 

G|GpV 

G|GpV 

G|VA 

G|GpVA 

G|GpV 

GpV 

GpV 

G|VA 

G|VA 

G|GpVA 

G|VA 

G|VA 

(greedy) 

90.0 

89.4 

95.3 

86.1 

55.5 

84.7 

67.5 

77.3 

94.5 

92.0 

80.8 

77.8 


6. Conclusion 

We described a simple and efficient technique to trans¬ 
fer deep CNN models pre-trained on specific image clas¬ 
sification tasks to another tasks. Our approach is based 
on Bayesian LS-SVM, which combines Bayesian evidence 
framework and SVM with a least squares loss. In addi¬ 
tion, we presented a faster fixed point update rule for ev¬ 
idence maximization through Aitken’s delta-squared pro¬ 
cess. Our fast Bayesian LS-SVM demonstrated competitive 
results compared to the standard SVM by selecting a deep 
CNN model in 12 popular visual recognition problems. We 
also achieved the state-of-the-art performance by identify¬ 
ing a good ensemble of the candidate models through our 
Bayesian LS-SVM framework. 
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